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ABSTRACT 

Standard univariate analyses of brain imaging data have revealed a host of structural and functional brain alter- 
ations in schizophrenia. However, these analyses typically involve examining each voxel separately and making 
inferences at group-level, thus limiting clinical translation of their findings. Taking into account the fact that brain 
alterations in schizophrenia expand over a widely distributed network of brain regions, univariate analysis 
methods may not be the most suited choice for imaging data analysis. To address these limitations, the neuroim- 
aging community has turned to machine learning methods both because of their ability to examine voxels jointly 
and their potential for making inferences at a single-subject level. This article provides a critical overview of the 
current and foreseeable applications of machine learning, in identifying imaging-based biomarkers that could be 
used for the diagnosis, early detection and treatment response of schizophrenia, and could, thus, be of high clin- 
ical relevance. We discuss promising future research directions and the main difficulties facing machine learning 
researchers as far as their potential translation into clinical practice is concerned. 

© 2013 The Authors. Published by Elsevier Inc. All rights reserved. 
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1. Introduction 

Schizophrenia is a highly complex mental disorder characterized by 
hallucinations, delusions, cognition deficits and emotional disturbances. 
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The diagnosis of schizophrenia primarily relies upon identifying clini- 
cal symptoms and the accurate assessment of behavioral signs 
through interview with a medical specialist. Considering, however, 
the variety of clinical presentations of the disorder among patients, 
the symptomatic overlap with other disorders such as Bipolar Disor- 
der (Demirci and Calhoun, 2009) and the subjectivity involved in 
current psychiatric practice (Lawrie et al., 2011), reliable objective 
markers for diagnosing schizophrenia and related conditions are highly 
desirable. 

Over the past years, schizophrenia has been intensively studied using 
neuroimaging techniques, such as structural and functional magnetic 
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resonance imaging (sMRl and fMRl respectively) in order to identify the 
neurobiological processes underlying the disorder, with the ultimate 
scope of developing new diagnostic and therapeutic initiatives. There 
are now many sMRI and fMRl studies in schizophrenia which implicate 
a range of structural and functional brain abnormalities (Dauvermann 
et al., 2013; Lawrie and Abukmeil, 1998; Olabi et al., 2011; Wright 
et al., 2000), some of which are evident even before disease onset and 
are predictive of illness (Lawrie et al., 2008; Moorhead et al., 2013). 

The majority of structural MR1 studies have employed Region of In- 
terest (ROI) or Voxel-based Morphometry (VBM) methods for the anal- 
ysis of neuroimaging data, to compare groups of patients and groups of 
controls, and reported deficits mainly in the temporal and prefrontal 
lobes (Lawrie and Abukmeil, 1998; Meisenzahl et al., 2008), particularly 
in the superior temporal gyrus ( Honea et al., 2005), the medial temporal 
lobe (Honea et al., 2005; Wright et al., 2000), including the amygdala 
and hippocampal complex and the parahippocampal gyrus, as well as 
enlargement of the lateral ventricles (Shenton et al., 2001). Similar 
structural abnormalities have been detected in groups of patients in 
the early stages of schizophrenia (Kubicki et al., 2002; Steen et al., 
2006). These are less pronounced compared to the established state, 
suggesting active disease processes around the time of onset, although 
genetic factors, substance misuse, antipsychotic drug treatment and 
other factors may be partly responsible (Meisenzahl et al., 2008; Olabi 
et al., 201 1 ). There are, similarly, replicated gray matter density changes 
over time in high-risk individuals as they develop schizophrenia, again 
particularly in the prefrontal and temporal lobes (Job et al., 2005; 
Pantelis et al., 2003). Moreover, functional MR1 studies have examined 
differences in function and cognitive ability between schizophrenia 
and healthy controls, reporting abnormal activation in a network of 
brain regions, particularly implicating the prefrontal cortex (Meyer- 
Lindenberg, 2010) and connectivity from it to the rest of the brain 
(Lawrie et al., 2002). 

Despite the fact that the univariate methods used in these analyses 
have delivered quite consistent and interesting results, they suffer, how- 
ever, from certain limitations. ROI methods are confined to predefined 
brain regions and cannot capture distributed patterns of neuroanatom- 
ical and neurophysiological abnormality across the brain. VBM and 
other approaches to computational morphometry, on the other hand, 
require brain averaging and cannot capture individual deviations from 
the norm. To this end, the scientific community has turned to machine 
learning in an effort to detect the MRI correlates of clinical relevance 
and utility. Machine learning methods have already been applied in 
the analysis and interpretation of functional and structural MRI data 
(LaConte et al., 2005; Lemm et al., 2011; Pereira et al., 2009), in 'mind 
reading' paradigms (Cox and Savoy, 2002; Haynes and Rees, 2006), in 
the classification of cognitive states (Mitchell et al., 2004; Mourao- 
Miranda et al., 2005), and in lie detection approaches (Davatzikos 
et al., 2005a). More recently, classification algorithms have been applied 
to diagnose neurological and psychiatric disorders (Bray et al., 2009; 
Kloppel et al., 2011; Orru et al., 2012), such as dementia (Davatzikos 
et al., 2011; Kloppel et al., 2008a; Kloppel et al., 2008b), depression (Fu 
et al., 2008; Mourao-Miranda et al., 2011) and schizophrenia 
(Davatzikos et al., 2005b; Fan et al., 2008b; Koutsouleris et al., 2009; 
Koutsouleris et al., 201 1 ). Multivariate pattern recognition techniques 
provide the possibility of making inferences about a subject's health sta- 
tus at an individual level and, thus, are well suited for clinical decision 
making purposes. 

In this paper, we highlight the application of machine learning in the 
analysis of structural and functional MRI data in diagnosing schizophre- 
nia, particularly for making an early prediction in people at high-risk of 
developing the disorder. We first give a brief overview of machine 
learning theory and the common processing steps that almost every 
machine learning method shares in their image analysis pipelines. 
Then, we discuss the studies that have employed machine learning in 
schizophrenia research and finally, we analyze the main practical chal- 
lenges and limitations that machine learning methods suffer from, in 



the context of their potential integration into routine clinical practice, 
before concluding with future research directions. 

2. Methods 

The standard approach to the analysis of structural and functional 
MRI data is based on the General Linear Model (Friston et al., 1995), in 
that neuroimaging data are modeled as a linear combination of vari- 
ables, potentially confounding parameters and some error. Statistical 
tests are then performed on each and every voxel independently in 
order to make inferences about effects of interest at a group-level, lim- 
iting the practical value of MRI in clinical settings. Multivariate pattern 
recognition methods have been used to overcome these limitations, 
by examining multiple voxels jointly, in order to identify patterns of dif- 
ferentiation between the groups and make inferences at a single-subject 
level. 

2.1. Overview of machine learning 

Machine learning (ML) is a term used to describe a set of methods for 
detecting patterns in data that would enable reliable future predictions. 
There are two major methodological approaches: supervised and 
unsupervised machine learning techniques. In supervised learning, the 
goal is to find a mapping from the data instances x ( to a set of desired 
outputs y,,, given a set of labeled input-output pairs D = {Xj, y,}, for 
i = 1 ... N instances. Here, D is the training set, consisting of feature 
vectors x, and their corresponding labels drawn from label set y ( and N 
is the number of the training instances. If y, is a categorical or nominal 
variable drawn from a finite set, for instance y ( = {1, 2, ... C}, then the 
problem is known as a classification problem. In its simplest form 
whereC = 2(andthusyi = {—1,1}) this is a binary classification prob- 
lem, whereas if C > 2, then there is a multi-class classification problem. 
On the other hand, if y-, is a real-valued (continuous) variable, the prob- 
lem is known as regression. In unsupervised learning, on the other hand, 
the goal is to identify an inherent structure in the data in order to classify 
given data instances D = {xj into groups (clustering). 

2.2. Classification pipeline 

The following steps in the image analysis pipeline are common to 
most machine learning methods: 

2.2.3. Preparation of the training set 

The first step in an ML analysis is the creation of the training set. This 
procedure involves two main processes: i) feature extraction and ii) fea- 
ture selection. Feature extraction involves the transformation of the 
original data set into a form that would be meaningful for the classifier 
to process. In the context of neuroimaging, this procedure entails the ex- 
traction of feature vectors corresponding to intensity values of voxels 
from each subject's scan. Feature selection involves a procedure for 
selecting those feature vectors that are better at discriminating between 
the classes and thus could facilitate and speed up the classification pro- 
cess. Feature selection can be performed either with a dimensionality 
reduction technique (such as Principal Component Analysis) or by 
constraining the research to specific brain areas for which the research 
team possesses prior knowledge about their likely involvement in the 
condition under investigation. Feature extraction is an obligatory step 
in the classification pipeline, but feature selection approaches are 
optional. 

2.2.2. Model training and testing 

In the model training step of the pipeline, the chosen algorithm has 
to learn the relationship between the training set and the labels associ- 
ated with it, while trying to optimize the algorithm's parameters in 
order to maximally discriminate between the groups. In the testing 
phase, the algorithm tries to predict the class label (in the case of 
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classification) or the continuous variable (in the case of regression) of 
previously unseen data instances. It is veiy important that the algorithm 
generalizes well to new instances. That is, the testing set should not in- 
clude instances of the training set to avoid circularity or data overfitting. 
Cross-validation techniques are a popular way to ensure this. In k-fold 
cross validation, the original data set is split into k non-overlapping 
sets and then the algorithm is trained using k — 1 subsets and the 
left-out set is used in the testing phase. The procedure is repeated k 
times, so that every subgroup is used in the testing phase. 

2.23. Performance evaluation 

The final step is the evaluation of classification performance of the 
method. This usually includes measures such as sensitivity, specificity 
and accuracy. Sensitivity refers to the proportion of actual positive 
cases correctly identified (e.g. the number of schizophrenia patients 
identified as in the ill group or class) and is computed by the 
TP / (TP + FN), where TP is the number of true positives and FN is the 
number of false negatives. Specificity refers to the proportion of the neg- 
atives cases correctly classified (e.g. healthy controls correctly identified 
as being healthy) and is computed by the amount TN / (TN + FP), 
where TN is the number of true negatives and FP is the number of false 
positives. Accuracy refers to the overall amount of correct classifications 
across the groups and is computed by TP + TN / TP + TN + FN + FP, 
or by the amount of (sensitivity + specificity) / 2, if the classes are bal- 
anced. Permutation tests are frequently applied as well, in order to deter- 
mine the statistical significance of the classifier's performance. In these 
tests, class labels are randomly assigned between the groups in a certain 
amount of times, and the cross-validation procedure is repeated. By cal- 
culating the number of times that the sensitivity and specificity for the 
permuted labels are higher than the real ones, and dividing by the num- 
ber of times one has permuted the labels, one can obtain a p-value for the 
classification accuracies. 

2.3. Machine learning methods explained 

A significant number of ML techniques, which have been applied in 
neuroimaging contexts, include Support Vector Machine (SVM), Sup- 
port Vector Regression (SVR), Linear Discriminant Analysis (LDA) and 
Independent Component Analysis (ICA). Below, we briefly discuss the 
methodology behind each method. 

SVM is one of the most popular supervised machine learning 
methods used in neuroimaging settings, partly because it can deal effec- 
tively with high-dimensional data and provide good classification re- 
sults. The aim of a SVM classifier is to find a decision surface that 
would optimally distinguish between classes and based on that surface 
assign new, previously unseen data instances into the groups. In the 
training phase, the classifier computes the optimal decision surface 
expressed in the form f(x) = w-x + b only by a subset of the original 
training set D = <x,, y,> called the support vectors. Support vectors 
are data points that lie closest to the optimal separating hyperplane 
and hence are the most difficult patterns to classify (see Fig. 1 ). The op- 
timal hyperplane is determined by maximizing the margin of separation 
between the two classes (which is equal to 2/| |w| ). Equally, the problem 
of finding the optimal hyperplane, thus, becomes an optimization prob- 
lem where we need to: min ||w|| subject to y-, (Xj- w + b) — 1 > 0. The 
constraint part of the quadratic problem ensures that no data points can 
lie in the margin. 

In the testing phase, the classifier is required to predict the label y, of 
new, previously unseen data instances, by evaluating y = sgn(w-x + b). 
In case where the data are not linearly separable, kernels are introduced 
to the machine. Kernels are functions that allow a mapping of the orig- 
inal, non-linearly separable data into a new feature space where the 
data are linearly separable. Polynomial, Gaussian and radial basis func- 
tion (RBF) are some of the most commonly used kernels. 

Support Vector for Regression (SVR) follows the same principles as 
SVM, but the goal here is to assign a data sample into a continuous 




Fig. 1. Representation of a linear, binary SVM classifier. The optimal separating hyperplane 
is the one with the largest margin of separation between the two groups and is described 
as a function off(x) = w«x + b, where w is a weight vector that is normal to the hyper- 
plane, b is an offset and b/||w|| is the distance from the hyperplane to the origin. Points in 
the dashed lines represent the support vectors. During the training phase, the SVM classi- 
fier computes the optimal decision function f(x) and in the testing phase, this decision 
boundary is applied to new data instances. 



variable rather than a class. SVR aims to find a function that provides 
the optimum fit between the data samples and their continuous vari- 
ables, while specifying a tolerance margin of reliable generalization. 

Discriminant Function Analysis (DA) is primarily used to predict 
group membership from a set of continuous variables (features). DA in- 
volves two steps: i) evaluating the significance of discriminant functions 
and of a set of predictors in discriminating the groups and ii) performing 
the classification by assigning data instances into the groups of interest. 
In the first step, DA computes the discriminant functions which are 
given by the equation: D = v^ + v 2 X 2 + ... v.Xj + a, where D is 
the discriminant function, Vj the discriminant coefficient (or weight), 
Xj the score of the variable i and a is a constant variable. The maximum 
number of discriminant functions is equal to the degrees of freedom 
(number of features minus 1 ), or the number of variables in the analysis, 
whichever is smaller. In this step, DA automatically determines some 
optimal combination of variables so that the first discriminant function 
provides the overall discrimination between groups, the second pro- 
vides second most and so on. Then, in the second stage classification 
can be performed. Subjects are classified into the groups in which they 
had the highest classification scores. In Linear Discriminant Analysis 
(LDA), the method looks for a linear combination of variables that 
would best classify data samples into a predefined number of groups. 
LDA can be used for both classification and feature reduction purposes. 
In the training stage, LDA computes linear transformations of the fea- 
tures that would provide a more accurate discrimination between the 
classes y,, given the training set <Xj, y,>. A transformation function is 
computed so that the ratio of between-class to within-class variances 
is maximized (Fisher's LDA). In most cases, there is no transformation 
that provides complete separation between the classes, so the goal 
is to find the transformation that minimizes the overlap of the 
transformed groups (see Fig. 2). Once, the discriminant function is com- 
puted and all data instances in the training set are transformed into the 
new C — 1 subspace (where C is the original number of features), clas- 
sification of new data instances can be performed (second stage of LDA). 
The discriminant function acts as a classification rule to assigning new 
data instances into the groups. 

Independent Component Analysis (ICA) is a multivariate statistical 
method, widely applied in problems of image and signal classification, 
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Fig. 2. Representation of LDA for a two-class classification problem based on synthetic two-dimensional data representing measurements in feature 1 and feature 2. As observed, classi- 
fication is more accurate if the data are projected onto the X dimension, as opposed to the Y dimension where there is substantial overlap between the classes, as shown in the histograms. 
Once the projection of data instances onto the dimension that fulfills Fisher's criteria is specified, new data instances can be classified based on a threshold (for example, if Xi < 4 classify as 
class 1, otherwise class 2) or a specified metric (e.g. Euclidean distance from the mean of a class). 



aiming to decompose a complex data set into independent sub-groups. 
In brain imaging, ICA is primarily used as a feature extraction and di- 
mensionality reduction technique by decomposing a brain scan into a 
set of statistically independent components, which correspond to tem- 
porally coherent brain networks. ICA makes the assumption that the 
originally measured data can be expressed as a linear combination of 
some latent variables, called eigen-images, and aims to map the original 
high-dimensional data into a linear subspace based on the eigen- 
images. If X is the original data set, X = (xi, x 2 , ... x n ) T with n data in- 
stances and s is a vector of the latent components S = (si, s 2 , ... s n ) T , 
then X can be written as a linear combination of the form: X = AS,, 
where A is a matrix of elements A = (ai, a 2 , ... a n ). In order to find 
the independent components S, one needs to compute the equation 
S = WX, where W is the inverse of the matrix A. Of note, independent 
components must be non-Gaussian for ICA to be possible. For a thor- 
ough review of this ICA approach, one can refer to Hyvarinen and Oja 
(2000). 

All of the presented machine learning methods have been widely 
used in neuroimaging-based studies of schizophrenia, producing vari- 
able classification results (see the tables). The choice of the machine 
learning method to be used is directly dependent on the nature of the 
data set and the classification problem at hand. It is important to note 
that each machine learning method has its own intrinsic strengths 
and weaknesses. For instance, SVM is a powerful method in detecting 
complex and subtle differences between groups due to the fact that 
only support vectors affect the determination of the decision function. 
SVMs can also work efficiently with complex, non-linear data whereas 
LDA can only be applied on groups that can be separated by a linear 
combination of features. LDA is the optimal classification model when 
the distributions of the data are Gaussian (parametric method), where- 
as SVM is a non-parametric classification method and as such, more ef- 
ficient in handling data that are not regularly distributed or have an 
unknown distribution. SVM might, therefore, be more appropriate in 
real-world data sets where the distribution of the data is not always 



known. On the other hand, LDA is a more simple and straightforward 
method and does not require any tuning of parameters, whereas 
SVMs' performance depends on the choice of the kernel and its param- 
eters ( Burges, 1 998 ) . Therefore, SVMs can be slower and have high com- 
putational processing and memory requirements, especially when it 
comes to large training data sets. Another limitation of LDA is that it is 
upper-bounded, thus constraining the application of the method in 
cases where more features are needed, and can only be used for classi- 
fication, not regression problems. 



3. Machine learning in schizophrenia 

In the past few years, an increasing number of studies have 
employed machine learning to investigate the neuroanatomical and 
neurophysiological correlates of schizophrenia. These studies can be di- 
vided into three main categories: (i) studies that examine the diagnos- 
tic power of machine learning in distinguishing between healthy 
controls (HC) and schizophrenia patients (SCHZ), (ii) studies which ex- 
amine the potential of machine learning to make an early diagnosis of 
schizophrenia (prediction) by comparing scans at baseline of people 
at high risk (either for familial or clinical reasons) of making a transition 
to the disorder and (iii) studies which examine the performance of ma- 
chine learning in predicting progression of the disease and response to 
treatment, usually by examining the baseline scans of first-episode (FE) 
patients with a later known clinical outcome or treatment response. An 
online search of PUBMED was performed in order to detect suitable pa- 
pers for inclusion, using the following search words: (machine learning 
OR pattern recognition) AND (psychosis OR schizophrenia) AND (diag- 
nosis OR early diagnosis OR prediction OR transition to schizophrenia 
OR disease progression OR treatment response) AND/OR (MRI OR 
fMRI). Twenty seven studies met our inclusion criteria - of presenting 
original data about an ML application in patients with formally diag- 
nosed schizophrenia - and are discussed below. 
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3.3. Diagnostic studies of schizophrenia 

The first study to apply a sMRl-based classification method was 
conducted by Davatzikos et al. (2005b), who tested the performance 
of Support Vector Machine (SVM) in classifying 69 schizophrenia 
patients (46 men, 23 women) and 79 matched healthy controls (41 
men, 38 women), reaching a 81% classification accuracy via leave- 
one-out cross-validation. The authors also tested individual men and 
women classifiers and observed similar classification results (85% accu- 
racy for the male and 82% for the female classifier), possibly implying 
good generalizability of the MRl-based diagnostic system. In another 
study by the same group, Fan et al. (2007) achieved an impressive 
91.8% and an 90.8% accuracy in distinguishing between the same 23 
female SCHZ patients and 38 female HC and 46 male SCHZ patients 
and 41 male HC respectively. Here, the development of an adaptive re- 
gional feature extraction method, that automatically grouped morpho- 
logical traits of similar classification power, along with a SVM-Recursive 
Feature Elimination method, that selected features with the highest 
discriminatory power, may possibly account for what still remains one 
of the best diagnostic performances observed in chronic schizophrenia 
diagnostic studies published to date. The researchers achieved this diag- 
nostic performance by using just 39 features for the female and 44 
features for the male individual classifiers. This diagnostic result was, 
however, obtained from a feature set that might be specific to this sam- 
ple group and the result may well not generalize to other data samples. 
In the context of examining family members of schizophrenia, only one 
study has up-to-date investigated the role of genetic factors in the dis- 
order, using MRI-based machine learning (Fan et al., 2008b). Fan et al. 
(2008b) observed that unaffected family members share similar pheno- 
typic patterns to their affected schizophrenia relatives. Although these 
initial results are encouraging, longitudinal studies are, however, 



essential in determining whether this endophenotypic pattern is pres- 
ent before disease onset and how it relates ( if so) to transition to schizo- 
phrenia in unaffected relatives. 

Evaluating a classifier on a totally independent cohort is of course 
the ideal way of examining the generalizability and robustness of the 
classifier (Nieuwenhuis et al., 2012). Unfortunately, the consequent 
need for large data sets makes this endeavor very difficult. In an impres- 
sive two-stage study, Kawasaki et al. (2007) observed a 80% classifica- 
tion accuracy using a partial least squares model that was trained on 
30 male HC and 30 male SCHZ patients and tested on a new, indepen- 
dent cohort of 16 male controls and 16 SCHZ patients. In a particularly 
large classification study employing an independent test set, diagnostic 
accuracy was however only about 70% (Nieuwenhuis et al., 201 2), when 
testing a SVM classifier developed on 239 participants (128 SCHZ) on a 
completely independent sample of 277 subjects (155 SCHZ). The use of 
a larger validation set may partly account for the lower diagnostic accu- 
racy, if we take into account the possible inclusion of more variable 
schizophrenia phenotypes in this larger group. 

Several studies have, alternatively, employed fMRI in an attempt to 
establish the diagnosis in groups of people with schizophrenia and 
controls (Table 2). These studies have included various cognitive 
tasks (Costafreda et al., 2011; Yoon et al., 2012) or resting-state fMRI 
(Calhoun et al., 2006; Shen et al., 2010; Venkataraman et al., 2012), 
in which the subject is simply instructed to remain still during scan- 
ning, not to think of anything in particular and not to fall asleep. In 
recent fMRI studies, resting-state paradigms are often preferred to 
task-related approaches, as they are free from task-related confounds 
and easier for patient populations to perform, although they do have 
limitations (Morcom and Fletcher, 2007). The diagnostic accuracy of 
resting-state fMRI-based classification methods ranged from 75% 
(Jafri and Calhoun, 2006; Venkataraman et al., 2012) to 92% 



Table 1 

Studies employing machine learning and structural MR1 to distinguish patients with schizophrenia from healthy controls. 



Author 


Sample (N, diagnostic classification) 


ML methods and scanner field strength 


Classifier's Performance (accuracy %) 


Davatzikos et al. (2005b) 


HC = 79, SCHZ = 69 


SVM 


81.1 




DSM-1V 


1.5 T 




Fan et al. (2007) 


HC, = 38 (females) 


SVM-RFE 


HC, vs SCH, = 91.8 




SCH, = 23 (females) 


1.5 T 


HC 2 vs SCH 2 = 90.8 




HC 2 = 41 (males) 








SCH 2 = 46 (males) 








DSM-IV 






Kawasaki et al. (2007) 


Train set: HC = 30 


DA 8, MLM 


80 




SCHZ = 30 (males) 


1.5 T 






Test set: HC = 16 SCHZ = 16 (males) 








DSM-IV 






Yoon et al. (2007) 


HC = 52, SCHZ = 53 


SVM 


90 




DSM-IV 


1.5 T 




Sunetal. (2009) 


HC = 36, ROS = 36 


SMLR 


86.1 




DSM-IV 


1.5 T 




Karageorgiou et al. (201 1 ) 


HC = 47, ROS = 28 


sMRI & neuropsychological data PCA-LDA 


92 




SCID-I for DSM-IV 


3 T 




Kasparek et al. (2011) 


HC = 39, FE = 39 


MLDA 


72 




ICD-10 


1.5 T 




Greenstein et al. (2012) 


HC = 99, COS = 98 


RF 


73.7 




DSM-111R/1V 


1.5 T 




Nieuwenhuis et al. (2012) 


Train set: HC = 111 


SVM 


70.4 




SCHZ = 128 


1.5 T 






Test set: HC = 122 








SCHZ = 155 








DSM-IV 






Zanettietal. (2013) 


HC = 62, FE = 62 


SVM 


HCvsFE = 73.4 




DSM-IV 


1.5 T 




Borgwardt et al. (2012) 


HC = 22, FE = 23 


ensemble SVM 


HC vs FE = 86.7 




ARMS-T = 16 


1.5 T 


HCvs ARMS-T = 80.7 




DSM-1IIR 




FEvs ARMS-T = 80 



Abbreviations: ARMS-T, at-risk mental state with transition to schizophrenia; COS, child-onset schizophrenia; DA, discriminant analysis; DSM-IV, Diagnostic and Statistical Manual of Mental 
Disorder Fourth Edition; DSM-IIIR, Diagnostic and Statistical Manual of Mental Disorder Third Edition Revised; FE, first-episode schizophrenia patients; HC healthy controls; ICD-10, the Interna- 
tional Statistical Classification of Disease and Related Health Problems; IDA, linear discriminant analysis; MLDA, maximum-uncertainty linear discrimination analysis; MLM, multivariate linear 
model; PCA, principal components analysis; RF, random forests; ROS, recent-onset schizophrenia; SCHZ, schizophrenia patients; SCID-I, Structural Clinical Interview; SMLR sparse multinomial 
logistic regression; SVM, Support Vector Machine; SVR, Support Vector Regression; SVM-RFE, Support Vector Machine with Recursive Feature Elimination. 
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(Costafreda et al., 2011; Shen et al., 2010), suggesting that resting- 
state fMRI has the potential to be useful in clinical practice. Results 
should be interpreted with caution, however, since the sample sizes 
in most cases (Anderson et al., 2010; Shen et al., 2010) are very 
small and potentially introduce a bias to the classification (Demirci 
and Calhoun, 2009). 

Adequate sample size is an important consideration in the robust- 
ness and reliability of the proposed classification system. Classification 
models based on small sample sizes tend to favor diagnostic perfor- 
mance (Anderson et al., 2010; Fan et al., 2007; Kawasaki et al., 2007; 
Sun et al., 2009; Yang et al., 2010; Yoon et al., 2007) whereas in studies 
evaluating larger samples, which possibly include a wider range of phe- 
notypic manifestations of schizophrenia, classification accuracy tends to 
be worse (Greenstein et al., 2012; Nieuwenhuis et al., 2012; Zanetti 
et al., 2013). Differences in the image analysis and classification pipe- 
lines might, also, partly explain such variation in findings. The intro- 
duction of refined feature selection methods can boost classifiers' 
performance, as was observed in Fan et al. (2007), compared to a previ- 
ous study of the same group (Davatzikos et al., 2005b). The choice of the 
machine learning method is another crucial factor in the performance of 
the diagnostic model as well. Notably, SVMs tend to provide better clas- 
sification results (Pereira et al., 2009) (see Table 1) than other pattern 
recognition methods, although, a direct comparison between the ma- 
chine learning methods used in the presented studies and classification 
performance cannot be performed due to other differences in the imaging 
and clinical characteristics of the samples used. 

The clinical characteristics of patients may play a significant role 
in the observed fluctuations in accuracy across diagnostic studies 
(Greenstein et al., 2012; Zanetti et al., 2013). Machine learning in FE 
schizophrenia studies seems to deliver worse diagnostic performance 
(Kasparek et al., 2011; Yoon et al., 2012; Zanetti et al., 2013) than 



studies of established schizophrenia (see Tables 1 and 2), possibly due 
to the less pronounced brain alterations in the former group, although 
diagnostic accuracies can be as high as 92% (see Table 1). It is known 
that the first-episode stage of schizophrenia is characterized by less 
marked brain changes than in chronic schizophrenia, and this could 
partly account for the accuracy fluctuations observed (see Tables 1,2). 
In addition, comorbid disorders and patient recruitment procedures 
may, also, have an effect on the sensitivity of the classifier in detecting 
disease-specific patterns. For instance, Zanetti et al. (2013) recruited a 
population-based sample of FE patients with comorbid substance use 
disorders, using epidemiological methods in order to ensure represen- 
tativeness of 'real-world' individual cases, and observed just 73.4% accu- 
racy in classifying them against HCs. In childhood-onset schizophrenia 
(COS), only one study examined the neuroanatomical correlates in 98 
COS subjects (all below the age of 13) versus 99 HCs (Greenstein 
et al., 2012) and observed moderate diagnostic accuracy (73.7%), possi- 
bly due to the young age of their patients and the fact that their uncon- 
solidated brain structure may hinder the detection of clear, concrete 
brain patterns that would facilitate classification. Factors associated 
with the use of anti-psychotic drug treatment are, also, a serious consid- 
eration because medication may have an effect on brain structure 
(Pantelis et al., 2003) possibly even up to a point that the sensitivity of 
the classifier to detect morphological abnormalities specifically associ- 
ated with schizophrenia diagnosis is compromised. 

32. Early diagnostic studies of schizophrenia 

Several recent neuroimaging studies have shown structural and 
functional abnormalities in subjects at high-risk of developing schizo- 
phrenia compared to healthy controls as well as compared to established 
patients (Lawrie et al., 2008; Mechelli et al, 2011; Smieskova et al., 



Table 2 

Studies employing machine learning methods and functional MRI in diagnosing schizophrenia. 



Author 


Sample (N, diagnostic classification, fMRl paradigm) 


ML methods and scanner field strength 


Classifier's performance (accuracy %) 


Jafri and Calhoun (2006) 


HC = 31, SCHZ = 38 
DSM-IV 

Resting-state paradigm 


ICA&NN 
3T 


76 


Calhoun etal. (2008) 


HC = 26, SCHZ = 21 


1CA 


SCHZvsN-SCHZ: 




DSM-IV 


1.5 T 


Sensitivity = 92 




AOD task 




Specificity = 98 
HCvsN-HC: 
Sensitivity = 95 
Specificity = 88 


Shen etal. (2010) 


HC = 20, SCHZ = 32 
DSM-IV 

Resting-state paradigm 


Unsupervised classifier based on C-means 
1.5 T 


92.3 


Yang etal. (2010) 


HC = 20. SCHZ = 20 

DSM-IV 

AOD task 


FMRI & genetic data SVM 
3 T 


87 


Anderson et al. (2010) 


HC = 6, SCHZ = 14 
DSM-IV 

Resting-state paradigm 


1CA&RF 
3 T 


85 


Castro etal. (2011) 


HC = 54. SCHZ = 52 

DSM-IV 

AOD task 


1CA & composite kernels with RFE 
3 T 


95 


Costafreda etal. (2011) 


HC = 40, SCHZ = 32 
DSM-IV 

Verbal fluency task 


SVM 
1.5 T 


SCHZvsHC:92 


Fanet al. (2011) 


HC = 31, SCHZ = 31 
DSM-IV 

Resting-state paradigm 


1CA&SVM 
1.5 T 


85.5 


Venkataraman etal. (2012) 


HC = 18. SCHZ = 18 
DSM-IV 

Resting-state paradigm 


RF 
3 T 


75 


Yoon etal. (2012) 


HC = 51, FE = 51 
DSM-IV 

Cognitive control task 


LDA 
1.5 T 


61.8 



Abbreviations: AOD, auditory oddball discrimination; BD, bipolar disorder; DSM-IV, Diagnostic and Statistical Manual of Mental Disorder Fourth Edition; FE, first-episode schizophrenia 
patients; HC, healthy controls; ICA. independent component analysis; LDA, linear discriminant analysis; NN, neural networks; N-BD, non-bipolar subjects; N-HC, non-healthy controls; 
N-SCHZ, non-schizophrenia subjects; RF, random forests; SCHZ. schizophrenia patients; SVM, Support Vector Machine. 
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2010). To date, there are no biological markers for the identification of 
emerging psychosis, which is currently identified by clinical symptom- 
atology. The early identification of those high-risk individuals who are 
most likely to develop psychosis is of high potential clinical value, as 
early intervention and treatment planning could alleviate symptoms 
burden or even prevent disease onset (Marshall and Loockwood, 2006; 
Riecher-Rossler et al., 2006). Job et al. (2005) were the first to assess 
the predictive value of gray matter reductions in genetic high-risk 
subjects regarding the possible transition to schizophrenia but they 
used univariate analysis methods, with their known limitations. More 
recently, machine learning has been applied in the context of making 
an early diagnosis of schizophrenia and even to predict disease transition 
at individual level (see Table 2), by identifying the neuroanatomical cor- 
relates of vulnerability to psychosis in individuals at high-risk of devel- 
oping the disorder mainly due to clinical reasons. 

Koutsouleris et al. (2009) were the first to apply multivariate pattern 
recognition to evaluate individual vulnerability to psychosis and predict 
disease onset. In their work, a SVM classifier was built upon structural 
MRI data of individuals in early (ARMS-E, n = 20) and late at-risk men- 
tal state of psychosis (ARMS-L, n = 25) and a group of matched healthy 
controls (HCi, n = 25). The performance of the classifier was validated 
by distinguishing sMRI data derived from baseline scans of individuals 
with subsequent transition to schizophrenia (ARMS-T, n = 15), those 
who did not make the transition (ARMS-NT, n = 18) and matched 
healthy controls (HC 2 , n = 17). Three group and pairwise classifiers 
were constructed, all achieving classification performance above 80% 
(with the exception for the binary classifier HQ vs ARMS-L = 78%). In 
the most critical in terms of clinical utility, the ARMS-T vs ARMS-NT 
pairwise classifier achieved an accuracy of 82%, suggesting the potential 
of a MRl-based system in predicting transition to schizophrenia. In a 
follow-up study, Koutsouleris et al. (2011) emphasized on the predic- 
tive potential of SVMs in classifying an independent cohort of 22 HC, 
16 ARMS-T and 21 ARMS-NT subjects. The authors, here, constructed a 
robust classification method, based on SVM ensemble classifiers that 
performed feature selection, model learning and predictive ensemble 



learning wrapped in a nested cross-validation framework. The critical 
ARMS-T vs ARMS-NT pairwise classifier showed slightly improved clas- 
sification results compared to that of Koutsouleris et al. (2009), whereas 
diagnostic performance was lower in the pain/vise HC vs ARMS-NT clas- 
sifier (66.9% accuracy as opposed to 86% in Koutsouleris et al. (2009)), 
possibly due to greater heterogeneity in the control sample. 

Despite the fact that neuroanatomical pattern classification methods 
provide veiy encouraging results in the context of prediction of disease 
transition, there is, however, some way to go before demonstrating 
their clinical utility. The small sample size in these studies limits the sta- 
tistical power of the MRI-based system proposed, so replication of the 
results to larger data sets is crucial. Another consideration is that the 
at-risk mental state sample in those studies involved symptomatic, 
help-seeking individuals (Koutsouleris et al., 2011) and it is therefore 
unclear if these classification results could generalize to asymptomatic 
high-risk groups as well. 

3.3. Predicting disease progression and treatment response 

Prediction of disease progression is also of interest and potential 
clinical utility in established cases of schizophrenia, with a view to 
establishing the prognostic context and/or therapeutic responsivity 
of the psychosis. Based on neuroanatomical pattern classification 
methods, studies reported poor to modest diagnostic performance 
(Table 3) in predicting the outcome of psychosis in FE schizophrenia pa- 
tients at baseline. In this context, Mourao-Miranda et al. (2012) used a 
linear SVM to predict clinical outcome from baseline sMRI scans of 
100 FE psychosis individuals, who at 6-year follow-up were classified 
as having a continuous, episodic or intermediate course and a group of 
91 matched HCs. Although classification accuracy was less than 75% in 
all contrasts (see Table 3), this result serves as a promising starting 
point in predicting subsequent course type at the individual level. In an- 
other study, Zanetti et al. (2013) failed to predict 1-year outcome of FE 
schizophrenia patients. Despite the fact that the authors presented a ro- 
bust method for feature generation and feature selection, their SVM 



Table 3 

Studies using machine learning to predict transition, progression and treatment response in schizophrenia. 



Author 



Sample(N, diagnostic classification) 



ML methods and scanner 
field strength 



Classifier's performance 
(accuracy %) 



Koutsouleris et al. (2009) 



Khodayari-Rostamabad et al. (2010) 



Koutsouleris et al. (2010) 



Koutsouleris et al. (2011) 



Mourao-Miranda et al. (2012) 



Zanetti etal. (2013) 



HC, = 25,HC 2 = 17 

ARMS-E = 20, ARMS-L = 25, ARMS-T = 15, ARMS-NT = 18 

At inclusion: DSM-IV 

At follow-up: ICD-10 

Train set. SCHZ = 23 

R= 12,NR= 11 

Test set. SCHZ = 14 

At inclusion: DSM-IV 

Post-treatment evaluation: PANSS 

HC = 28.ARMS = 25 

ARMS-T = 12, ARMS-NT = 13 

At inclusion: DSM-IV 

At follow-up: ICD 

HC = 22, ARMS-T = 16, ARMS-NT = 21 
At inclusion: APS, BLIPS 

At follow-up: classification criteria by Yung et al. (1998) 

HC = 28, EP-PS = 28 

CON-PS = 28, INT-PS = 32 

At inclusion: ICD-10 

At follow-up: WHO Life Chart 

R-FE = 15,NRsub-FE = 21 

At inclusion: DSM-IV (SC1D) 

At follow-up: DSM-IV 



Structural MRI SVM 
1.5 T 



EEG kernel PLSR 



Structural MRI SVR 
1.5 T 



HC, vs ARMS-E vs ARMS-L = 
HC 2 vs ARMS-T vs ARMS-NT 



R vs NR = 85 



HCvsARMS: r = 0.83 

HC vs ARMS-T vs ARMS-NT: r = 



81 

= 82 



0.83 



Structural MRI ensemble SVM HC vs ARMS-T = 92.3 



1.5 T 



Structural MRI SVM 
1.5 T 



Structural MRI SVM 
1.5 T 



HCvs ARMS-NT = 66.9 
ARMS-T vs ARMS-NT = 84.2 
EP-PS vs CON-PS = 70 CON-PS vs HC = 
EP-PS vs HC = 54 



67 



R-FE vs NRsub-FE 
HCsub vs NR-FE = 



= 58.3 
64.3 



Abbreviations: ARMS, at-risk mental state; ARMS-E, at-risk mental state early; ARMS-L, at-risk mental state late; ARMS-T, at-risk mental state with Transition to schizophrenia; ARMS-NT, 
at-risk mental state without transition to schizophrenia; APS, Attenuated Psychotic Symptoms; BLIPS, brief limited intermittent psychotic symptoms; CON-PS, continuous psychotic; DSM- 
IV, Diagnostic and Statistical Manual of Mental Disorder Fourth Edition; EP-PS, episodic psychotic; HC, healthy controls; ICD-1 0, the International Statistical Classification of Disease and 
Related Health Problems; INT-PS, intermediate psychotic; NR, non-responders; NRsub-FE, subgroup of non-remittent first-episodes; partial least squares regression; PANSS, positive 
and negative syndrome scale; PSLR, partial least squares regression; R, responders; R-FE, remittent fist-episodes; SCHZ, schizophrenia patients; SCID, Structured Clinical Interview; 
SVM, Support Vector Machine; SVR Support Vector Regression; WHO, World Health Organization. 
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classifier (based on the method proposed in Fan et al. (2007)), achieved 
an 58.3% accuracy in predicting clinical outcome of 1 5 FE patients with a 
subsequent remitting course versus 21 first-episodes with a subsequent 
non-remitting course. Differences in data samples (and/or data sample 
selection procedures) and in the duration of follow-up study might 
partly explain the accuracy discrepancies observed between the two 
studies. 

A key determinant of prognosis in psychosis is diagnosis, both be- 
cause schizophrenia tends to have a worse outcome than bipolar disor- 
der, and because these conditions tend to respond differently to 
treatments. Early studies have shown the possibility of distinguishing 
group activation patterns on fMRl in schizophrenia and bipolar disorder 
(Mcintosh et al., 2008), but little SVM work has thus far been done in 
this vein, especially at first presentation when it might be most valuable. 
In the context of predicting response to treatment in schizophrenia, 
only one study that we are aware of has thus far employed machine 
learning to do so. Khodayari-Rostamabad et al. (2010) used kernel par- 
tial least squares regression in order to predict response to clozapine in 
chronic schizophrenia subjects, based on pre-treatment electroenceph- 
alography (EEC) data, providing 85% accuracy in identifying responders 
and non-responders to the medicine. 

The really useful thing to do, for clinicians, patients, their families, 
and society more widely, would be to determine likely therapeutic re- 
sponse to different treatments in individual patients to facilitate the 
timely and effective application of particular treatments to patients 
who need them. In someone with psychotic symptoms, in early inter- 
vention or general adult services, it would be very useful to know, for 
example, who needs treatment because spontaneous resolution is un- 
likely or would take too long or, who needs ongoing treatment to 
avoid relapse and/or optimize day-to-day function. It would also be 
helpful to be able to identify those who are likely to be treatment- 
resistant who will need treatments like clozapine or intensive rehabili- 
tation at an early stage in treatment. These goals remain aspirational at 
the moment, and may well require the inclusion of multiple data 
sources for successful implementation. 

4. Discussion 

In this review, we have presented an overview of machine learning 
methods in clinical studies and a detailed consideration of studies 
employing them in schizophrenia research. Studies published so far 
demonstrate promising leads for the development of neuroimaging ma- 
chine learning-based tools that could assist in establishing the diagnosis 
and prognosis of schizophrenia and therefore be useful in clinical prac- 
tice. Machine learning methods are advantageous compared to stan- 
dard univariate statistical methods, in that they have the potential to 
make inferences about effects of interest at a single-subject level and 
can detect subtle and widespread neuroanatomical and functional dif- 
ferences that span over large networks of brain regions, by virtue of 
their multivariate nature. 

The development of a MRl-based machine learning system could well 
aid in the identification of objective biological markers for schizophrenia, 
and could thus help overcome the subjectivity in traditional clinical as- 
sessments. There are, however, significant hurdles to be overcome before 
their integration of machine learning into clinical practice is possible. The 
classifiers' performance is a key element for the potential integration of 
machine learning into clinical decision making. As a general observation, 
diagnostic classification performance in psychiatry may not supersede 
clinical expertise in the foreseeable future, no matter the techniques 
employed, since training a classifier requires prior knowledge of a 
subject's clinical status (Orru et al., 2012). Where imaging and machine 
learning could seriously impact upon clinical practice is where future di- 
agnosis, outcome and treatment response are difficult to predict. The 
identification of high-risk individuals, likely to convert to schizophrenia 
is of high clinical value as a means to inform early treatment strategies 
that could have better outcome for the patients. It is, however, evident 



from the early diagnosis studies thus far (see Tables 1-3) that classifica- 
tion accuracy in the early detection of schizophrenia and predicting clin- 
ical course is not as high as in diagnostic schemes. This is probably 
explained by the fact that in the diagnosis of established groups of pa- 
tients from controls, neuroanatomical and functional patterns of differ- 
entiation are more clearly and strongly established than in same group 
subjects who do or do not go on to show an outcome of interest and 
therefore present a more difficult classification problem. 

It should be borne in mind that a classifier with high sensitivity and 
high specificity is desirable, and that overall accuracy is important, but 
the relative value of high and low sensitivity and specificity could 
have different implications in patients' clinical management, in different 
clinical scenarios, depending on the availability of treatment and the se- 
riousness and frequency of adverse effects. Moreover, for an individual- 
ized patient high positive/negative predictive power is the most critical 
consideration (Lawrie et al., 2011). Furthermore, classification perfor- 
mance is primarily affected by the sample size. The limited number 
and nature of patient populations in SVM neuroimaging-based studies 
mean that these encouraging early results may not generalize well to 
other patient groups. Recruiting patients for research studies can be dif- 
ficult and patients with co-morbid conditions are often excluded, 
resulting in a limited representation of the various phenotypes across 
the spectrum of schizophrenia. Despite the fact that several machine 
learning methods can deal effectively with small sample size (Pereira 
et al., 2009), a limited number of data samples can cause model 
overfitting, resulting in poor generalization of the method to indepen- 
dent data sets. In such cases, cross-validation frameworks are often 
employed, to partition the original data set. However, cross-validation 
schemes should be performed and interpreted with caution, because 
there is a serious danger of biasing classifier's performance, especially 
in cases where data samples in the validation set are also present in 
the testing set. As a general rule, the greater the complexity of a method, 
the higher is the risk for overfitting the data (Mourao-Miranda et al., 
2012). Ideally, data for validation should be derived from completely in- 
dependent cohorts from the training population, as the case in a few 
model studies thus far (Kawasaki et al., 2007; Nieuwenhuis et al., 
2012) in order to ensure the robustness and reliability of the system. 

The need for large data sets could be addressed with pooling data 
from multiple research centers (Mechelli et al., 2011 ). The existence of 
a well-validated training dataset to be shared between neuroimaging 
centers is likely to be of importance for standardizing classification ac- 
curacy across laboratories. In addition, future multi-site studies could 
provide the possibility for encompassing more heterogeneous clinical 
populations, demonstrating a range of clinical manifestations of a disor- 
der (Borgwardt and Fusar-Poli, 201 2 ), for example subjects with various 
transition rates to psychosis or subjects of lower diagnostic certainty, 
which could thus provide a more realistic mirroring of everyday psychi- 
atric practice. Data sharing among research centers faces, however, its 
own difficulties. Different scanners, imaging parameters and protocols 
result in varying image intensity and susceptibility profiles that will re- 
quire careful consideration and compatibility solutions. One promising 
approach is however to generate metrics from individual scans that 
can then be compared to reference data sets (Tijms et al., 2011). 

It is a priority that future studies also address the challenge and op- 
portunity of fusing neuroimaging data from various imaging modalities, 
along with genetic and clinical information, that seem likely to interact 
in determining the development and outcome in schizophrenia (Lawrie 
et al., 201 1 ). It would be reasonable to assume that the introduction of 
neurocognitive and other clinical measures could possibly enhance di- 
agnostic power of the classifier. Just as a clinician takes a detailed report 
of symptoms and other clinical measures to diagnose a patient with 
schizophrenia, so might the integration of symptom severity measures 
and other neurocognitive scores, along with MRI scans aid to the classi- 
fication process. Early studies have already shown that classification 
performance might well be improved (Sui et al., 2012; Yang et al., 
2010), as in Karageorgiou et al. (2011) where Karageorgiou et al. 
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observed a 92% accuracy in classifying recent-onset schizophrenia when 
structural MRI data and neuropsychological variables (NP) were com- 
bined than when employing either quantitative measure alone (86.7% 
when only NP data were used and 70.7% with sMRI data alone). Other 
neuroimaging technologies such as arterial spin labeling (ASL) perfu- 
sion MRI and diffusion tensor imaging ( DTI ) have shown very promising 
leads in unraveling the neurobiological substrate of several psychiatric 
and neurological disorders (Pinkham et al., 2011; Sussmann et al., 
2009; Van Essen et al., 2012), and might as well be combined with 
MRI methods in schizophrenia research. The interpretability of such 
data is not, however, necessarily straightforward and, as a general 
mle, each additional diagnostic variable increases sensitivity at the ex- 
pense of specificity. It is overdue, though, that combined features, such 
as symptoms, duration of illness, genomics and proteomics along with 
various brain imaging modalities are incorporated into imaging and 
other evaluations in clinical research studies, with the scope of making 
more reliable and objective judgements about the diagnosis of schizo- 
phrenia and to classifying patients into more homogenous subgroups 
(Lawrie et al., 2011). 

Equally important, future studies should test the efficacy of machine 
learning in making a diagnosis of psychiatric disorders apart from 
schizophrenia, such as bipolar disorder, borderline personality disorder, 
depression and autism. Initial studies have already used machine learn- 
ing to diagnose schizophrenia and bipolar disorder versus HC subjects 
(or in a one-vs-all rationale), providing very encouraging results 
(Calhoun et al., 2008; Costafreda et al., 2011). However, replication of 
these early findings in studies that include larger samples and more 
cases across a putative psychosis spectrum is necessary in order to iden- 
tify patterns that differentiate between these psychiatric disorders. 

From a methodological point of view, novel methods for feature se- 
lection and decision making of the classifiers could be introduced in 
order to improve diagnostic power in schizophrenia studies. For exam- 
ple, ensemble learning methods could be introduced in order to improve 
the generalization ability of a classifier. Ensemble classifiers can achieve 
better predictive performance than single classifiers, by combining mul- 
tiple weak learning models that decide upon the classification of a new 
instance through majority voting (Polikar, 2006). Some well-known en- 
semble learning methods, such as bagging and random subspace 
methods have already been used in neuroimaging settings to identify bi- 
ological markers for prodromal Alzheimer's disease (Fan et al., 2008a; 
Liu et al., 2012), reporting excellent diagnostic results. Ensemble learn- 
ing could be a useful approach in data fusion studies as well, where a sin- 
gle classifier could be built and trained for each imaging modality and/or 
clinical measures (such as neurocognitive measures) separately and 
outputs from each classifier could be combined to classify new in- 
stances. An example of this approach is the study of Yang et al. (2010), 
who developed SVM-based ensemble classifiers of genetic and fMRI 
data and combined them to a single module that decided upon classifi- 
cation of testing samples via majority voting, achieving better diagnostic 
accuracy than either SVM ensembles alone (87% for the combined mod- 
ule, 74% for the genetic data classifier and 83% for the fMRI classifier). 
Future studies could, also, possibly address the problem of 'tuning' a ma- 
chine learning method to fit into neuroimaging settings. Refinements in 
the SVM method, for example, already exist. The SVM-Recursive Feature 
Elimination (SVM-RFE), a very popular method that performs feature 
selection during training and recursively removes data instances, and 
has already been successfully employed in cancer classification (Guyon 
et al., 2002), and SVM-Sequential Minimal Optimization (SVM-SMO) 
which facilitates and speeds up the classifier's training, are methods 
yet to be validated for their efficacy in neuroimaging settings. Finally, 
probabilistic machine learning might also be a promising tool in 
neuroimaging-based schizophrenia research. More specifically, probabi- 
listic machine learning can be used to quantify a degree of uncertainty in 
the prediction and could thus be applied in the context of predicting 
transition to psychosis or future clinical outcome, indicating for example 
a percentage of confidence for classification into one group or another 



(e.g. 75% risk transition to schizophrenia and 25% not making a 
transition.). 

The application of machine learning methods for the purposes of di- 
agnosing or making a prognosis in schizophrenia has already demon- 
strated very encouraging results. The main advantage of machine 
learning methods, over standard univariate ways of analyzing and 
interpreting neuroimaging data, is that they may allow inferences to 
be made at subject-level, a feature essential in clinical practice. There 
are however, important difficulties yet to be fully considered and over- 
come, before their translation into routine clinical practice. The optimal 
means of multi-center analyses, fusing imaging modalities and integrat- 
ing various sources of information are critical considerations. Finally, 
once suitable techniques have been developed, they will ideally need 
to be tested, preferably in randomized control trials to ensure that 
they are acceptable and useful to clinicians and patients. 
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